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Abstract 

The Ising Model has recently received much attention for 
the statistical description of neural spike train data. In this 
paper, we propose and demonstrate its use for building de- 
coders capable of predicting, on a millisecond timescale, 
the stimulus represented by a pattern of neural activity. 
After fitting to a training dataset, the Ising decoder can 
be applied "online" for instantaneous decoding of test 
data. While such models can be fit exactly using Boltz- 
mann learning, this approach rapidly becomes compu- 
tationally intractable as neural ensemble size increases. 
We show that several approaches, including the Thouless- 
Anderson-Palmer (TAP) mean field approach from statis- 
tical physics, and the recently developed Minimum Prob- 
ability Flow Learning (MPFL) algorithm, can be used for 
rapid inference of model parameters in large-scale neural 
ensembles. Use of the Ising model for decoding, unlike 
other problems such as functional connectivity estima- 
tion, requires estimation of the partition function. As this 
involves summation over all possible responses, this step 
can be limiting. Mean field approaches avoid this prob- 
lem by providing an analytical expression for the parti- 
tion function. We demonstrate these decoding techniques 
by applying them to simulated neural ensemble responses 
from a mouse visual cortex model, finding an improve- 
ment in decoder performance for a model with heteroge- 
neous as opposed to homogeneous neural tuning and re- 
sponse properties. Our results demonstrate the practical- 
ity of using the Ising model to read out, or decode, spatial 



patterns of activity comprised of many hundreds of neu- 
rons. 



1 Introduction 

Interpreting the patterns of activity fired by populations 
of neurons is one of the central challenges of modern sys- 
tems neuroscience. The design of decoding algorithms 
capable of millisecond-by-millisecond readout of sensory 
or behavioural correlates of neuronal activity patterns 
would be a valuable step in this direction. Such decoding 
algorithms, as well as helping us to understand the neural 
code, may have further practical application, as the basis 
of communication neural prostheses for severely disabled 
patients such as those with "Locked In" syndrome. 

At the heart of such a decoding algorithm must lie - 
whether explicit or implicit - a description of the con- 
ditional probability distribution of activity patterns given 
stimuli or behaviours. Making this description is nontriv- 
ial, as the brain, like other biological systems, exhibits 
enormous complexity. This results in a very large number 
of possible states or configurations exhibited by the sys- 
tem, making the description of such systems by simply 
measuring the probabilities of each state unfeasible. Ex- 
cept for very small patterns, a model-based approach of 
some kind is essential. 

New technologies in neuroscience such as high-density 
multi-electrode array recording and multi-photon calcium 
imaging now make it possible to monitor the activity of 
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large numbers of neurons simultaneously. Analysis tools 
for such high dimensional data have however lagged be- 
hind the experimental technology, as most approaches are 
limited to very small population sizes. While consider- 
able advances have been made in the use of information- 
theoretic approaches to characterise the statistical struc- 
ture of small neural ensembles (Gawne et al. 1996, Panz- 
eri, Schultz, Treves & Rolls 1999, Schultz & Panzeri 
2001, Panzeri & Schultz 2001, Reich et al. 2001, Petersen 
et al. 2001, Pola et al. 2003, Montani et al. 2007), finite 
sampling limitations have made results for larger ensem- 
bles much more difficult to obtain. 

For the statistical description of multivariate neural 
spike train data, parametric models able to capture most 
of the interesting features of real data while still being 
of empirically accessible dimensionality are highly desir- 
able. One promising approach has emerged from statis- 
tical mechanics: the use of Ising (or Ising-like) models, 
exploiting an analogy between populations of spike trains 
and ensembles of interacting magnetic spins (Shlens et al. 
2006, Shlens et al. 2009, Schneidman et al. 2006). 

Our aim here is to devise an algorithm for "millisecond- 
by-millisecond" neural decoding, on the basis that in- 
formation processing in the nervous system appears to 
make use of such fine temporal scales (Carr 1993, Bair 
& Koch 1996). The timescale of the "symbols" used in 
information processing is thus likely to be somewhere be- 
tween 1 and 20 ms for most purposes (Butts et al. 2007). 
For time bins on this scale, neural spike trains are ef- 
fectively binarized, and the simplest binary model (in 
the maximum entropy sense) that captures pairwise cor- 
relations is the Ising model. The Ising model is thus 
a natural way to describe the statistics of neural spike 
patterns at the timescale of interest. Fitting of such a 
model to the observed neural data has the advantage that 
it does not implicitly assume some non-measured struc- 
ture in the data, i.e. maximum entropy models express the 
most uncertainty about the modelled data given the (ex- 
plicit) chosen constraints (e.g. that certain moments of the 
measured distribution agree with the model distribution) 
(Jaynes 1957). It can be shown that this is mathematically 
equivalent to maximizing the likelihood of the model pa- 
rameters to explain the observed data (Berger et al. 1996). 
By using this approach to fit a model to the conditional ac- 
tivity pattern distribution, in conjunction with maximum a 
posteriori decoding (Foldiak 1993, Oram et al. 1998), it is 



possible to train a decoder which takes as its input a pat- 
tern of spiking activity, and gives as its output the stimulus 
that it determines to have elicited that spike pattern. 

A major obstacle to the use of Ising models for neu- 
ral decoding, is that, in general, it is necessary to com- 
pute a partition function (or normalization factor), involv- 
ing a sum over all possible states. This can be numeri- 
cally challenging, and for large numbers of neurons, un- 
feasible. In the present study, we adopted several ap- 
proaches for circumventing this problem. Firstly, we 
make use of mean field approximations, including both 
the 'naive' mean field approximation and the Thouless 
et al. (1977) (TAP) extension to it, following Roudi, Tyr- 
cha & Hertz (2009). Secondly, we compare this with 
the recently proposed Minimum Probability Flow Method 
(Sohl-Dickstein et al. 2009) for learning model param- 
eters. To assess the relative performance of these ap- 
proaches in the context of a discrete decoding problem, 
we simulated the activity of a population of neurons in 
layer V of the mouse visual cortex during an experiment 
in which a discrete set of orientation stimuli were pre- 
sented. Using this simulation, we evaluated the relative 
performance characteristics of the different decoding al- 
gorithms in the face of limited data, exploring decoding 
regimes with up to 1000 neurons. We demonstrate, for 
the first time, the use of the Ising model to effectively de- 
code discrete stimulus states from large-scale simulated 
neural population activity. 

2 Methods 

2.1 Ising models of neural spike trains 

Activity states in an Ising model are Boltzmann dis- 
tributed, i.e. they are distributed according to the negative 
exponential of the "energy" associated with each state. 
This distribution, 

P oc e~'^f , is the maximum entropy distribution 
subject to the set of constraints imposed by Lagrange mul- 
tiplers A M on variables / M . Imposing these constraints 
upon firing rates and pairwise correlations gives 

Pisin g (r|s) = exp hi{s)ri + \^ Jij{ s ) r i r j j » 

(1) 
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where r = (n, ■ ■ ■ , rc) T and each binary response 2.2 Neural Decoding 

variable ?*j G {0, 1} indicates the firing/not firing of neu- 
ron i in the observed time interval. The parameters hi 
(known in statistical physics as 'external fields') and 
('pairwise couplings') have to be fit to the data such that 
the model displays the same means and pairwise correla- 
tions as the data: 



(Rising - (^Data ' ( 2a ) 
( r * r Asmg = (^-) Data , (2b) 

where (-) m odei denotes expectation with respect to the 
specified distribution. Z{s) is the partition function, 
which acts as a normalisation factor, i.e.: 



Z{s) = CX P \J2 hi ^ n + 



Jij (s)rirj 



(3) 

Note that the first sum is over all possible (as opposed to 
observed) responses, given by the set 1Z. 

In statistical physics it is more common to use a sym- 
metric representation <Tj € { — 1, 1} for the 'spins' that 
describe the activation of neuron i (with —1 indicating 
'no spike' and 1 indicating 'spike'), which simply corre- 
sponds to a change of variables oi = 2r^ — 1. Accord- 
ingly the fields, couplings and partition functions change. 
As it is occasionally more convenient to work in one or 
the other representation we will denote the fields and cou- 
plings in the spin representation with hi and Jij. 

Standard Monte Carlo techniques for fitting these 
model parameters, such as Boltzmann learning, which 
can in principle provide an exact solution - given the 
number of samples is high enough - become computa- 
tionally very expensive if not intractable as the num- 
ber of cells increases. We have found in previous work 
(Seiler et al. 2009), that the Boltzmann learning approach 
becomes computationally too expensive in our case for 
ensemble sizes larger than 30 cells. This poor scaling 
behaviour is mainly due to the exponentially increasing 
number of states with the number of cells. Speeding up 
the model fitting process is hence an essential requirement 
to utilize Ising models for studies with large ensembles 
of neurons. Solutions to speed up the "classical" Boltz- 
mann learning approach have been suggested (Broderick 
et al. 2007). However these are still associated with a high 
computational cost. 



In this paper we consider the problem of decoding which 
of a number of different stimuli has elicited a neural spike 
pattern. This can be seen as a discrete classification task: 
we have a set of S stimuli s E S = {s 1; s 2 , ■ ■ ■ , ss}. De- 
coding in this scenario means that we have to provide a 
decision rule that estimates which stimulus has been the 
input to the system, given an observed spike pattern r b S . 
The particular example to which we apply this is a simula- 
tion of the spike pattern responses elicited by visual stim- 
uli across the receptive field of a visual cortical neuron: 
in this case each stimulus Si represents a different orien- 
tation of a sinusoidal grating. Our main aim with this sim- 
ulation was to validate our methodology in a neurophys- 
iologically realistic coding regime, relevant to datasets to 
which our methodology might be applied. As a supple- 
mentary goal, we hoped to gain some insight into whether 
some aspects of the model affect decoding performance 
- such as heterogeneity of tuning, observed in real neural 
recordings but often ignored in population coding models. 

For decoding, we use the maximum a posteriori (MAP) 
rule: 

s = argmaxp(s|r obs ) (4) 

s 

p(r obs \s)p(s) 

= arg max -. r (5) 

p(r bs) 

= argmaxp(r obs |s)p(s), (6) 

s 

where the second step is the application of Bayes' theo- 
rem and the third equality holds because p(r obs ) is inde- 
pendent of s and is hence irrelevant for maximising the 
given expression, i.e. just a constant factor with respect 
to s that scales the maximum accordingly. In the case we 
examine here we assume we are in control of the stimulus 
distribution p(s), and thus we can choose it to be uniform, 
i.e. to exhibit the same constant probability for each stim- 
ulus and therefore be independent of s, as well. Hence 
our decoding rule simplifies further to the maximum like- 
lihood (ML) rule: 



s = argmaxp(r obs |s). 



(7) 



Within this setting, the task of creating a neural decoder 
reduces to the modelling of the stimulus dependent dis- 
tributions p(r\s). Once these are obtained, we can apply 
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our ML decoding rule (Equation |7]i to estimate the given 
input stimulus s. 

We have used two different statistical models to fit the 
observed spike patterns for each stimulus. Firstly, we have 
used an Ising model for p(r\s), i.e. we assume that for 
each stimulus, the spike pattern distribution can be de- 
scribed by a (different) Ising model. Secondly, we have 
used an independent model distribution pi nc j, assuming 
that given a stimulus, each cell is independent of the oth- 
ers: 



Pind(r|s) = Y[p(ri 



(8) 



i=X 



The independent model is the binary maximum entropy 
model of first order, i.e. it takes into account only the 
first order moments (the constraints on the means given 



by Equation 2a i and is therefore a natural comparison for 
the Ising model. As it is very easy to fit the independent 
model, we used this as a control method, to test whether 
the more complex Ising model could enhance decoding 
performance. Note that the numerical values for the prob- 
abilities can get very small for large cell ensembles, and 
therefore to evade finite precision problems we use in this 
case an equivalent log-likelihood decoding rule instead of 
the ML rule, i.e. maximise the logarithm of the likelihood 
instead of maximising the likelihood directly. 

2.3 Training and Testing 

To train and test the decoders, we proceed as follows: 

1. For each stimulus we simulate a set of possible re- 
sponse vectors. The details of the simulation are de- 
scribed in the following subsection. 

2. We separate the simulated response patterns into 
training data, which is used to fit the model and test 
data which we use to evaluate the decoding perfor- 
mance of the obtained models. 

3. The whole testing procedure is performed with 10 
fold cross-validation, i.e. we divide the whole data 
for each stimulus into 10 equally sized parts. We 
then use 9 parts of the data to train our model and 
the remaining one for testing. We repeat this pro- 
cess again with all 10 possible test/training data set 
combinations of this kind to reveal if our results gen- 
eralize to the whole dataset. 



2.4 Simulation of Evoked Spike Patterns 
from Mouse Primary Visual Cortex 

We simulated the transient response patterns of activity 
evoked by visual (orientation) stimuli in layer V pyrami- 
dal neurons of the anaesthetized mouse visual cortex. The 
orientation direction were chosen to be n ■ 180/5, where 
S is the number of stimuli and n€{0, 1,..,,5-1}. The 
properties of our simulation are motivated by the results 
reported by Niell & Stryker (2008). We simulated dif- 
ferent models by augmenting a basic model with mostly 
homogeneous response characteristics, with some come 
controlled heterogeneous characteristics. 

Our model is defined as follows, with parameters spec- 
ified in Table [T] The spontaneous activity of each neu- 
ron was set to 1.7 spikes per second, corresponds to the 
reported median value for layer V neurons in (Niell & 
Stryker 2008). We assumed that neuronal direction pref- 
erences were uniformly spaced around the circle. Each 
neuron's tuning curve was defined by a von Mises func- 
tion (circular Gaussian) with half width at half maximum 
(HWHM) fit to experimental data (Niell & Stryker 2008). 
The direction selectivity index (DSI) was set to 0. 1 for all 
layer 5 neurons in our model. Sustained firing rates were 
fit to the distributions reported in (Niell & Stryker 2008). 
To reflect that we are considering a situation in which a 
stimulus is decoded from a short time window (20 ms) of 
data, we multiplied these evoked rates by a fixed transient- 
to-sustained ratio of 1.5, taken to reflect the onset re- 
sponse of the neuron's response to a flashed stimulus. As 
our model is fit to data from directional (drifting grat- 
ing) stimuli, we took the arithmetic mean value of the two 
corresponding diametrically opposite directions for each 
neuron to compute the model response to a flashed orien- 
tation. 

The characteristics of the basic model were modulated 
via two inhomogeneity parameters 7, £ e [0, 1], to in- 
troduce a heterogeneous distribution of firing rates and 
the tuning widths respectively, as described in Table [T] 
We neglected inhomogeneity in other parameters. Thus 
the parameter 7 regulates firing rate heterogeneity, where 
7 = corresponded to the basic homogeneous mode,l 
and 7 = 1 to the most heterogeneous firing rate setting. 
The parameter £ was used analogously to regulate hetero- 
geneity in the tuning widths of the neurons. The effects 
of the two heterogeneity parameters 7, £ are illustrated in 
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Parameter 



Value 



Comments 



Preferred tuning direction uniformly spaced 0° — 360° 



Tuning width (HWHM) 

Direction Selectivity Index 
Spontaneous firing rate 

Sustained evoked firing rate 

Transient-to-sustained ratio 



38 + 
0.1 

1.7 + 7Xi(Hz) 
7 + 7 Y* (Hz) 
1.5 



£ G [0, 1]; #j normally distributed, fitted to Fig. 4f of Niell & 

Stryker (2008), truncated minimum of 10. 

fixed for all neurons; Niell & Stryker (2008) Fig. 5c. 

7 € [0, 1]; Xi normally distributed about zero, distribution fitted to 

Fig. 8d of Niell & Stryker (2008); truncated to minimum of 0.3 

7 € [0, 1]; Yi normally distributed, fitted to data in Fig. 8 and supp. 

fig. S3 of Niell & Stryker (2008); truncated to minimum of zero 

fixed for all neurons 



Table 1 : Parameters of the mouse visual cortical model simulated. 7 and e determine the extent of heterogeneity in 
the model, with 7, e = setting the model properties to homogeneous. 



Figure [T] 

Patterns of spikes fired by the neural population 
were simulated using a dichotomized Gaussian approach 
(Macke et al. 2009). Since we cannot estimate covariance 
matrices from experimental data directly, and not every 
positive definite symmetric matrix can be used as the co- 
variance matrix of a multivariate binary distribution, we 
adapted the following approach. First we compute upper 
and lower covariance bounds for each pair of neurons, ac- 
cording to (Macke et al. 2009) 

max{-p<7, -(1 ~p)(l - q)} < Cav(ri,rj) 

<min{(l-q)p,(l-p)q}, (9) 

where p and q are the means (mean spiking probabilities) 
of neuron i and j, respectively. We then choose a ran- 
dom symmetric matrix A that lies between these bounds. 
As in general this choice does not result in a permissible 
correlation matrix for the underlying Gaussian, a Higham 
(2002) correction is applied to find the closest correlation 
matrix possible for the latent Gaussian (cf. Macke et al. 
(2009)), to which we finally arithmetically add a random 
correlation matrix with uniformly distributed eigenvalues 
to adjust the mean correlation strength. Having estab- 
lished a dichotomized Gaussian model we can thus draw 
samples with high efficiency. 

Where not otherwise stated in the text, 10000 trials per 
stimulus were simulated, allowing 9000 training samples 
and 1000 test samples with 10-fold crossvalidation. In the 
absence of a detailed characterization of the correlation 
structure of neural responses in the mouse visual cortex, 



we assumed that the correlation in firing between each 
pair of neurons was weak and positive. Our simulation 
results in a mean correlation of 0. 1 1 and a standard devi- 
ation of 0.040 (measured with 100000 samples for differ- 
ent ensemble sizes) for our basic model and similar levels 
of correlation for nonzero 7, £. Due to limitations of the 
dichotomized Gaussian simulation, we were not able to 
specify the correlations of the spike trains/between the in- 
dividual neurons exactly, thus all reported correlations are 
measured and may be prone to small variations. However, 
such limitations would be inherent to any simulation ap- 
proach, as i) the covariance structure of a multivariate bi- 
nary distribution is always constrained by the firing prob- 
abilities of the cells and can not be chosen independently 
of these firing rates (Macke et al. 2009), and ii) finite sam- 
pling effects will always affect the simulated data, result- 
ing in fluctuations in the correlation structure. 

We were able to vary the (measured) mean absolute 
correlation level in some simulations, allowing an assess- 
ment of the relative effects of correlation strength for de- 
coding. To do this we proceeded as follows: having es- 
tablished the correlation matrix of the latent Gaussian as 
described before allows us to sample in a regime with cor- 
relations around 0.1. Likewise a latent Gaussian with a 
correlation matrix given by the identity matrix would cor- 
respond to the case where there are no correlations in the 
latent Gaussian and thus in the simulated spike trains it- 
self, which means that the simulated spike patterns for 
each neuron are independent (note that due to the model 
used in Macke et al. (2009) the correlation matrix and the 
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covariance matrix of the latent Gaussian are the same). Fi- 
nite sampling effects might still introduce some nonzero 
measured correlations in the spike patterns, but the un- 
derlying distribution would still be with independent neu- 
rons. Thus, by interpolating between these two cases we 
can reduce the effective correlations in the data in a con- 
trolled way, where at one end we yield our original model 
and at the other end we yield a set of independent neurons. 

2.5 Fitting the Ising Model Parameters 

For fitting the model parameters in the Ising model case 
we use two different strategies: mean field approxima- 
tions and minimum probability flow learning. In earlier 
work we used Boltzmann learning (Seiler et al. 2009), 
however as this becomes rapidly computationally in- 
tractable with an increasing number of neurons, we have 
not reported it here. 

2.5.1 Mean Field Methods 

The suitability of different Mean Field approaches for fit- 
ting the parameters of an Ising model have been recently 
assessed by Roudi, Tyrcha & Hertz (2009). In their work, 
Roudi et al. compared the learned model parameter as 
inferred by Boltzmann learning, with the parameters in- 
ferred by a number of different approximative algorithms. 
Here we examine the utility for decoding of using succes- 
sively higher order mean field approximations. Tanaka 
(1998) demonstrated how to systematically obtain mean 
field approximations of increasing order based on a Ple- 
fka series expansion (Plefka 2006) of the Gibbs free en- 
ergy. By truncating these series to terms up to n-th or- 
der, and using the linear response correction, it is possible 
to derive an n-th order mean field approximation, yield- 
ing C equations for the external fields, and C(C — l)/2 
equations for the pairwise couplings (cf. Tanaka (1998)). 
These equations can then be solved with respect to 
and hi. For higher order approximations, these equations 
can have more than one solution. This problem can be re- 
solved by considering that the Plefka series expansion is 
effectively a Taylor series expansion and continuity of the 
solutions is expected when higher order terms are gradu- 
ally increased (Tanaka 1998). 

Tanaka further provided an explanation for the "diago- 
nal weight trick" as used by Kappen & Rodrguez (1998). 



With this trick one introduces C extra equations for the 
pairwise "self-couplings" Ja, which can be used to refine 
the respective approximations. The success of this trick 
can be explained (Tanaka 1998) by considering that us- 
ing this diagonal weight trick in an n-th order method, is 
effectively incorporating the dominant terms of the next 
higher (n + 1) order expansion. 

For fitting the parameters, we have compared different 
mean field approximations from zeroth to second order 
with and without the diagonal weight trick, thus incor- 
porating with our highest approximations all second or- 
der terms and the dominant third order terms of the free 
energy expansion. The zeroth order method is thereby 
equivalent to the independent model, thus providing an- 
other way of thinking about the independent decoder. 

First order methods (naive mean field approximation). 

For the first order or naive mean field approximation, the 
equations for the external fields and pairwise couplings 
become: 

J = -CT 1 (i ^ j), (10a) 

hi = tanh -1 rrii — Jijirtj, (10b) 

where J is the estimated coupling matrix with elements 
Jij, the 'magnetization' m; = (<7j), and the covariance 
matrix C is defined by CV, = {ci<Jj) — mirrij. In the 
following we will denote this naive mean field method by 
nMF 

By incorporating the diagonal weight trick the above 
equations change slightly into the following: 

j = P 1 C \ (11a) 

hi = tanh -1 mi — ^ JijiJij, (lib) 

3 

where in addition to the matrices defined above, we have 
defined the diagonal matrix Pij = (1 — mf )Sij, with dij 
being the Kronecker delta. Note that in the nomencla- 
ture of Roudi, Tyrcha & Hertz (2009), this has been called 
"naive mean field method". However we will in the fol- 
lowing refer to it as naive mean field method with diago- 
nal weight trick (nMFwd). 
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Second order methods (TAP approximation). The 

equations of the second order method are also known as 
the TAP equations (Thouless et al. 1977) and can be seen 
as a correction to the naive mean field methods. Using 
the TAP approach the equations for the model parameters 
read: 

2J i |m;m j + Jij + (CT%- =0 (t ^ j), (12a) 
hi = tanh -1 m; — ^ JijTUj + m, ^ J^(l — m|), 

(12b) 

where the first equation can be solved for the pairwise 
couplings Jij. As mentioned previously the correct so- 
lution has to be chosen according to continuity condi- 
tions outlined in Tanaka (1998), from which then the 
external fields hi can be computed. More precisely, if 
mjmj(C _1 )jj < we choose the solution, which is 
closer to the original first order mean field solution. If 
rrnmj (C -1 )^ > we use the first order mean field solu- 
tion directly. We use this procedure as it avoids pairwise 
couplings becoming complex, and respects the continu- 
ity of the inverse Ising problem for as a function of 
(C _1 )ij. In the following this method is denoted by TAP. 

Incorporating third order terms. The second order 
method can also be augmented by the diagonal weight 
trick, hence incorporating the leading third order terms 
of the free energy expansion. The equations for the TAP 
method with diagonal weight trick (TAPwd) are given by: 

2J ij 2 mim j + Jij + (CT%- =0 (i ^ j), (13a) 

Ju = - (C-% (i = j), (13b) 

1 -mf 

hi = tanh -1 TOj — ^ JijiUj, (13c) 
i 

where we have solved the equations in an analogous fash- 
ion to that described for the normal TAP approach. 

2.5.2 Minimum Probability Flow Learning 

Sohl-Dickstein et al. (2009) recently proposed the Min- 
imum Probability Flow Learning (MPFL) technique, 
which provides a general framework for learning model 



parameters. As this technique is also applicable to the 
Ising model, we have used it to learn external fields and 
pairwise couplings for our model. However, as the sam- 
pling regime usually feasible in neurophysiological exper- 
iments dictates a small number of samples compared to 
the number of parameters in the model (which is 0(C 2 ) 
with C cells), the learning problem for the parameters be- 
comes under-constrained already at intermediate neural 
ensemble sizes, i.e. we are likely to have more param- 
eters to fit than there are samples. 

We therefore introduced a regularization term to their 
original objective function to penalize model parameters 
growing to large numbers, i.e. to avoid overfitting. Given 
the original objective function K(9) with 9 being the pa- 
rameters of our model, our regularized objective function 
reads: 

K ieg (e) = K(e) + ^\\e\\l (14) 

where 1 1 - 1 1 2 is the Li norm, which is a common choice of 
regularization term (Bishop 2007). So for the Ising model 
case we have: 

i &j 

For the present work, the regularization parameter was 
set to A = 0.0127, after systematically assessing differ- 
ent settings for an ensemble size of 100 cells via cross- 
validation. We refer to this learning algorithm as rMPFL 
(regularized MPFL) in the rest of the manuscript. 

Other choices for the regularization term are possible 
and might even result in better performance for decoding 
purposes, e.g. two independent penalty terms for the ex- 
ternal fields and the pairwise couplings. However an ex- 
tensive assessment of different parameter settings would 
be very time consuming due to the cost of calculating 
the invoked partition function. We therefore have not 
performed an exhaustive analysis of regularization. In a 
real experiment the regularization term should however be 
adapted to yield the best possible performance for the spe- 
cific number of cells. We not that while it is not necessary 
to compute the partition function for some applications of 
MPFL (e.g. if learning the Jij parameters is the end in 
itself), it is required for decoding. 
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2.6 Partition Function Estimation 



with 



Estimating the partition function is a computationally ex- 
pensive task, since the set of possible responses 1Z grows 
exponentially with the number of cells C, rendering an 
analytical computation (Equation [3]) intractable for large 
neural ensembles. 

As MPFL learning does not provide an estimate of the 
partition function, we used the Ogata-Tanemura partition 
function estimator (Ogata & Tanemura 1984, Huang & 
Ogata 2001), which is based on Markov Chain Monte 
Carlo (MCMC) techniques. However MCMC is still a 
very time consuming technique. To speed up the model 
fitting process, we estimated the partition function for 
each stimulus only once in a 10 fold cross-validation run 
when using MPFL, as ideally all samples for a specific 
stimulus should come from the same distribution, thus ap- 
proximately sharing the same partition function. We have 
examined the effect of this approximation (see Results). 

When fitting the model parameters with mean field the- 
oretic approaches, we computed the (true) partition func- 
tion Z(s) in the mean field approximation, as reported in 
Kappen & Rodrguez (1998), Thouless et al. (1977), and 
Tanaka(1998). 

For the first order mean field approach this yields: 

log Z = 1o S ( 2 cosh ft + W ij) - 

i i 
i<j 

(15) 

with 

Here each of the parameters is actually a function of the 
stimulus s, which we omit for clarity. 

For the second order methods, the corresponding equa- 
tion becomes: 

log Z = X log (2 cosh (hi + LA ) - L i m i 

i i 
i<j i<j 

(16) 



Li = Jijinj - m l ^ Jj(l - mj). 

2.7 Performance evaluation 

The fraction of correctly decoded trials was the principal 
method used to assess decoding performance. However, 
as the fraction correct or accuracy does not by itself pro- 
vide a complete description of decoder performance, we 
sought used additional performance measures. The per- 
formance of a decoder is fully described by its confusion 
matrix, and we show how directly examining this matrix 
can yield insight into its behaviour. However, it is advan- 
tageous to be able to reduce this to a single number in 
many cirumstances. We therefore additionally computed 
the mutual information between the encoded and decoded 
stimulus (Panzeri, Treves, Schultz & Rolls 1999) to char- 
acterise the performance further. This provides a compact 
summary of the information content of the decoding con- 
fusion matrix. 

We can write the mutual information (measured in bits) 

as: 

I(s,s) = H(s)-H(s\s), (17) 
where H(s) is the entropy of the encoded stimulus 

= lQg 2j >(«), (18) 

and H(s\s) is the conditional entropy describing the dis- 
tribution of stimuli s that have been observed to elicit each 
decoded state s, 

H(s\s) = - p(s,s)log 2 p(s\s). (19) 

Since we have in the current study opted for a uniform 
stimulus distribution, the entropy H(s) is simply given 
by 

H(s) = log 2 S. (20) 

In general the conditional entropy H(s\s) has to be com- 
puted from the confusion matrix. We note that if we were 
to assume that the correctly decoded stimuli and errors 
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are uniformly distributed for all stimuli, i.e. that the con- 
ditional distribution p(s\s) is of the form 

!f c for s = s 

l-/c , . , 
— far.* a, 

then the conditional entropy simplifies to 

H(s\s) = -f c log 2 f c - (1 - f c ) log 2 (j^A ■ 

This simplified expression has been used to characterize 
decoder performance in the Brain Computer Interface lit- 
erature (Wolpaw et al. 2002). Here, we present this equa- 
tion only to make apparent the scaling behaviour, and 
compute the decoded information using the more general 
expression (Eqn. \T7\. 

3 Results 

We performed computer simulations as described in 
Methods, to generate datasets for training and testing de- 
coding algorithms. For all settings, 20 simulations were 
performed with different random number seeds, in order 
to characterize decoding performance. A number of met- 
rics, including the fraction of correct decodings (accu- 
racy) and mutual information between decoded and pre- 
sented stimulus distributions, were used in order to char- 
acterize and compare decoding performance. 

Basic Model 

The Ising model based decoders show better performance 
than the independent decoder in nearly all cases (Fig. [3^), 
in terms of the average fraction of correctly decoded stim- 
uli. The performance of the standard (non-regularized) 
MPFL technique, however, in our hands falls away rel- 
atively quickly as the number of cells increases, fail- 
ing to better the independent decoder after approximately 
100 cells. This behaviour can be explained by consid- 
ering that the problem of parameter estimation becomes 
more and more underconstrained as the number of cells 
increases while holding the number of training samples 
fixed. Falsely learned model parameters moreover affect 
the decoding performance by influencing the estimated 



partition function and thus worsen the decoder perfor- 
mance. As we have only estimated the partition function 
once per stimulus when using MPFL, large fluctuations 
in the training dataset can potentially have a big effect. 
To compensate for this behaviour, a regularization term 
can be included, which can stabilize the performance up 
to a significantly larger number of neurons (as described 
in Methods). Our regularized version of MPFL still de- 
creases in performance after about 110 cells. However, 
we have used a fixed value for the regularization param- 
eter for all simulations here, whereas ideally the regular- 
ization term should be adapted to the number of cells and 
number of samples in the specific setting. Using such an 
approach would most likely result in a better performance 
for larger numbers of cells. As an example we have tested 
for an optimized A parameter for 150 cells and found that 
we could increase performance to 0.874 in this case (av- 
erage over 10 trials, not shown in figure). 

One of the assumptions made for much of this pa- 
per is that the partition function can be estimated only 
once in each 10-fold cross validation run, thus speeding 
up training. We studied the effect of this approximation 
on performance, finding, as shown in Figure [4] that the 
effect is marginal, at least under our operating conditions. 
No statistically significant difference between the two ap- 
proaches was observed. In a decoding regime where the 
rMPFL method starts to infer the wrong model parameters 
(e.g. due to overfitting) one would expect these effects to 
become more pronounced. However, in such a scenario 
where the rMPFL method becomes unreliable, a different 
regularization term or a different inference method should 
be considered. 

The decoded information analysis reveals that the dif- 
ference in the decoding performance of the independent 
and Ising (as exemplified by TAP, TAPwd and rMPFL) 
models becomes more pronounced as the number of stim- 
uli is increased, as shown in Fig. [3j>,c. As the number 
of stimuli increase, the independent and Ising decoder 
curves separate, indicating not only a difference in the ac- 
curacy of both decoders, but also a difference in the con- 
fusion matrices, i.e. in the distribution of errors between 
the two approaches. This is considered in more detail in 
section [3~2l 

A further interesting behavior of the Ising decoder is 
apparent in Fig. [3};: as the number of stimuli increases, 
the relative decoding gain r\, (which we define as the ra- 
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Figure 4: Decoding performance effects of reduced par- 
tition function estimation for rMPFL method as Box- 
Whisker plot. Interquartile range is indicated by blue box, 
the median value by a red line. Whiskers are used to visu- 
alize the spread of the remaining data for values at most 
within 1.5 times the interquartile range from the ends of 
the box. Left: results for computing the partitioning func- 
tion for each stimulus every round for 10-fold cross val- 
idation. Right: results for computing the partition func- 
tion only in the first round for each stimulus. Data from 
20 simulation runs, 70 cells, basic model, 10000 samples 
simulated per stimulus 

tio between the "actual" and "chance" fraction correct) 
keeps increasing with the number of stimuli for the Ising 
model case, whereas it saturates for the independent de- 
coder. This effect can be explained as follows: the inde- 
pendent decoder relies on the fact that two different stim- 
uli will result in different firing rates for each cell. With an 
increasing number of stimuli however, the difference be- 
tween two adjacent stimuli becomes smaller and smaller, 
and thus the difference in the firing rates discriminating 
two adjacent stimuli becomes smaller and smaller. There- 
fore the decoder performance of the independent decoder 
rapidly decreases as the number of stimuli increases. As 
two adjacent stimuli can result in neural responses that 
have quite different correlation structure despite having 
very close firing rates, the Ising decoder can additionally 
make use of this information in the data and thus yield 
better performance. This suggests that the Ising decoder 



may be particularly advantageous as the decoding prob- 
lem becomes more difficult and not easily discriminable 
in terms of the neural firing rates. Real world performance 
will of course be dependent upon the level of systematic 
difference in correlation structure induced by stimuli in 
any given dataset. 

Performance of the Ising decoder is strongly dependent 
on the number of training trials available (Fig. [3}l i). 
Here we found, for 70 neurons, that around 400 train- 
ing samples were required to allow the Ising decoder to 
outperform independent decoding. The independent de- 
coder will necessarily have better sampling performance, 
as it relies only upon lower order response statistics. (It is 
worth recalling that a "full" decoder, which made use of 
all aspects of spike pattern structure, would have far worse 
scaling behaviour than either). Another way of looking at 
this is to examine the estimated covariance matrix from 
the simulated data (Fig. [3ji ii). We define the mean rela- 
tive error of the covariance matrix as 



E = 



-y-y 

s£5 l,J 



c: 



data 



C 



C; 



IJ.S 



(21) 



where Cffl is the estimated covariance between unit i and 
j under stimulus s from the (finite) simulated data (nor- 
mally 10000 samples), and Of* B is the asymptotic covari- 
ance, defined in the same way but computed with 100,000 
samples. This error provides a measure of the finite sam- 
pling bias we encounter for fitting the model. 



Heterogeneous Models 

The performance characteristics in a heterogeneous sce- 
nario, i.e. with nonzero £, 7, are for the most part broadly 
similar to the homogeneous case, so we report here only 
on the observed differences. The overall classification 
performance both in terms of fraction correct and mutual 
information, is slightly improved for both independent 
and Ising decoders with heterogeneous neural ensembles. 
As an example, the accuracy as a function of ensemble 
size is compared for the basic model and the "fully hetero- 
geneous" (7 = £ = 1) models in Fig|5J for both TAPwd 
and Independent decoders. The greater performance for 
nonzero 7, £ can be explained by the greater variability 
of cell properties, allowing more specific response pat- 
terns than in the homogeneous scenario. Such a scenario 
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Figure 5: Decoder performance is enhanced in the hetero- 
geneous scenario (£ = 7 = 1). Comparison of Fraction 
correct vs. number of cells for basic model and fully het- 
erogeneous model. The TAPwd algorithm was used to 
train the decoder. 

is presumably relevant to many real-world decoding prob- 
lems, suggesting that decoder performance analysed with 
homogeneous test data may slightly under-represent real- 
world performance. 

We also assessed the relative influence of the individ- 
ual parameters 7, £ on the decoding behavior as shown in 
Figure [6] Our analysis shows that, while the heterogene- 
ity in the firing rates as specified by 7 has a positive effect 
on the decoding performance, an increased tuning width 
heterogeneity slightly decreases the performance of the 
decoder. However, the relative influence of the parameter 
£ is small. 

3.1 Dependence on level of correlation 

As the advantage of the Ising Decoder over the Inde- 
pendent Decoder stems from its ability to take advantage 
of information contained in pairwise correlations, we ex- 
amined the dependence of this advantage on the average 
strength of correlation. Although we have set the average 
level of correlation to what has traditionally been thought 
to be a reasonable level for cells in the same neighbour- 
hood (Zohary & Shadlen 1994), there is an ongoing de- 
bate about the level and stimulus dependence of correla- 
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Figure 6: Decoder performance dependence on parame- 
ters £, 7. With increasing 7 the decoding performance is 
enhanced. The tuning width heterogeneity as specified by 
£ reduces the performance of the decoder (values from up- 
per left to lower right curve £ = {0, 0.2, 0.4, 0.6, 0.8, 1}). 
Data for 70 cells, TAPwd algorithm used to train the de- 
coder. 

tion relevant for cortical function (Renart et al. 2010, Bair 
et al. 2001, Nase et al. 2003, Ecker et al. 2010). This 
is of course critical for the performance of the Ising de- 
coder. If there were no (noise) correlations present in the 
data, i.e. an independent decoder were the correct model 
for the data, there would be no benefit to using any de- 
coder including correlation such as the Ising decoder. In 
fact, any decoder including correlations would most likely 
perform worse in practice than an independent decoder, 
as due to finite sampling effects one would most likely 
falsely estimate some small correlation in the data. Over- 
fitting to these falsely learned correlations would then re- 
sult in a performance decrease compared to an indepen- 
dent decoder, which by construction does not include any 
correlations, i.e. would implement the correct model. 

This effect be seen in Figure [7] where we have varied 
the mean absolute correlation strength as outlined in the 
methods section. It can be seen that as the level of corre- 
lation increases for specified spike count, the independent 
decoder loses some discriminative capacity. The Ising de- 
coders, however, take advantage of the higher level and 
spread of correlation values for discrimination between 
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Figure 7: Ising decoder benefits from a correlation 
regime. Upper panel: Fraction correct as a function of 
the measured average absolute Pearson correlation coeffi- 
cient for four decoding algorithms. Lower panel: Average 
Standard Deviation of the Pearson Correlation coefficient 
vs. average Pearson correlation coefficient. All averages 
taken over 20 simulations; 70 cells; basic model; correla- 
tions measured with 10000 samples. 



stimuli. As described in Methods, at the lower end of this 
curve the underlying latent Gaussian has an identity corre- 
lation matrix, and thus the individual neurons are actually 
independent, although the average measured absolute cor- 
relation does not completely decrease to zero. This pro- 
vides an explanation of why the Ising decoder fails to beat 
the independent decoder: by assuming the measured cor- 
relations are due to the true distribution, it overfits to these 
correlations, and thus performs worse compared to the 
(by construction correct) independent model. It should be 
noted, however, that the performance drop of the Ising de- 
coder compared to the independent decoder is small even 
for a regime where cells are effectively independent. 

To mimic the richer correlation structures potentially 
found in real neural recordings, we performed further test- 
ing. We simulated larger ensembles of neurons, while 
keeping the number of observed cells fixed, thus effec- 
tively creating "hidden" neurons. The visible neurons 
were simulated as described in methods, while for every 
unobserved cell, the preferred tuning direction was cho- 



Figure 8: Decoder robustness to correlations due to hid- 
den units. Fraction Correct fc for 100 cells, normalized to 
the fraction correct fco where 100 out of 100 cells are vis- 
ible, is displayed as a function of the ratio between Chidden 
hidden cells and C v isibie observed cells. For simulations 
the basic model was used. 

sen randomly according to a uniform distribution. By us- 
ing this scheme we could assure that the observed C v i s it,i e 
cells always had the same characteristics except the cor- 
relation structure, which was effectively changed by the 
introduction of Chidden unobserved neurons. It should be 
noted that introducing hidden cells alters not only the sec- 
ond order statistics but also higher order correlations in 
the data, thus providing a much richer statistical structure 
in the data. 

The different decoding (and training) methods vary in 
their robustness towards the introduction of such higher 
order correlation structure (Fig. [8}. While the indepen- 
dent decoder is relatively unaffected by the introduction 
of additional unobserved cells, the Ising decoder model 
is more sensitive to such changes. However, significant 
differences between the different training strategies can 
be observed: in our case the rMPFL method showed the 
least performance drop (about 10%) in a scenario where 
approximately 100 out of 500 cells were observed, while 
the fraction correct for the standard TAP approach drops 
by more than 15% in this scenario. TAPwd is less af- 
fected than TAP, consistent with its inclusion of the lead- 
ing terms for the next order expansion. 
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3.2 Confusion Matrix Analysis 

The confusion matrix provides complete information 
about the decoding error distribution. Confusion matri- 
ces for the Ising decoder and the Independent decoder 
respectively are shown in Fig. [9] The Ising decoder has 
higher diagonal terms, corresponding to better decoding 
accuracy. The overall appearance of the Ising decoder 
confusion matrix is fairly similar to the independent de- 
coder. However, by comparing the two confusion ma- 
trices it can be seen that the Ising decoder mainly gains 
its performance benefit over the independent decoder by 
avoiding confusion of adjacent stimuli. This shows that 
as the difference in adjacent stimulus directions becomes 
less with an increasing number of stimuli, the Ising model 
decoder can utilize the correlation patterns to enhance the 
decoding accuracy, i.e. to distinguish between adjacent 
stimulus directions more precisely. This effect of course 
depends on the correlation model used - here the correla- 
tion between each pair of neurons was resampled for each 
stimulus in the simulation. If (noise) correlations were 
not at all stimulus-dependent, or if the model was quite 
different, then the Ising decoder may not be able to take 
advantage of this potential performance advantage. 

3.3 Comparison with linear decoding 

While most work in the Brain-Machine Interface litera- 
ture has focused on continuous decoders, there has been 
some work on the discrete decoding problem, although 
to date with a focus on the analysis of either contin- 
uous data, such as electroencephalographic (EEG) data 
(Wolpaw et al. 2002), or on longer time windows of multi- 
electrode array data, in which spike counts are far from 
binary (Santhanam et al. 2009). As the discrete decoding 
problem can be viewed as a classification problem (in the 
same sense as the continuous decoding problem can be 
seen as a regression problem), it is of interest to compare 
the performance of our approach with traditional classi- 
fication approaches such as the Optimal Linear Classifer 
(OLC). 

Following Bishop (2007), each stimulus class can be 
described by its own linear model, so that 



scheme (i.e. we denote stimulus class Si by a "target" col- 
umn vector t with all zeros except the i-th entry, which is 
one) the weights w s can be trained such as to minimize a 
sum of squares error function for the target stimulus vec- 
tor. 

This is done in Fig. 
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where s = 1,...,S. Using a 1-of-S binary coding 



It can be seen that, under the 
conditions we test here (10000 trials, simulated data as 
described previously for the basic model), the OLC un- 
derperforms both the Ising (as exemplified by the TAPwd 
approach) and Independent classifiers. The former is not 
unexpected, but it may seem initially counter-intuitive that 
the OLC does not yield identical performance to the In- 
dependent decoder, as the latter is in effect performing a 
linear classification. 

However, the following differences must be noted for 
these two algorithms. As their implementation details 
differ, they may have markedly different sensitivity to 
limited sampling - the independent decoder, as we have 
constructed it here (product of marginals) has remark- 
able sampling efficiency. Most importantly however, the 
OLC decoder assumes by construction a Gaussian error 
in the stimulus class target vectors, i.e. it corresponds to 
a maximum likelihood decoding when assuming that the 
target vectors follow a Gaussian conditional distribution 
(Bishop 2007). This assumption is clearly not valid in the 
case of binary target vectors. Therefore the failure of the 
OLC decoder should not be surprising. 

4 Discussion 

We have demonstrated, for the first time, the use of the 
Ising model to decode discrete stimulus states from simu- 
lated large-scale neural population activity. To do this, we 
have had to overcome several technical obstacles, namely 
the poor scaling properties of previously used algorithms 
for learning model parameters, and similarly the poor 
scaling behavior of methods for estimating the partition 
function, which although not necessary for some appli- 
cations of the Ising model in neuroscience, is required 
for decoding. The Ising model has one particular advan- 
tage over a simpler independent decoding algorithm: it 
can take advantage of stimulus dependence in the cor- 
relation structure of neuronal responses, where it exists. 
With the aid of a statistical simulation of neuronal ensem- 
ble spiking responses in the mouse visual cortex, we have 



13 




10 

Number of Cells 

Figure 10: Comparison of the TAPwd Ising Decoder and 
Independent Decoder with the Optimal Linear Classifier 
(OLC). 



demonstrated that correlational information can be taken 
advantage of for decoding the activity of neuronal ensem- 
bles of size in the hundreds by several algorithms, includ- 
ing mean field methods from statistical physics and the 
rMPFL algorithm. 

Ising models have gained much attraction recently in 
neuroscience to describe the spike train statistics of neu- 
ral ensembles (Schneidman et al. 2006, Shlens et al. 2006, 
Shlens et al. 2009). However, these findings have largely 
been made only in relatively small neural ensembles (typ- 
ically a few tens of cells), from which an extrapolation to 
larger ensemble sizes might not be wise (Roudi, Niren- 
berg & Latham 2009, Roudi, Aurell & Hertz 2009). The 
principal reason for this limit has been the poor scaling of 
the computational load of fitting the Ising model param- 
eters, when algorithms such as Boltzmann learning are 
used. Moreover, new findings suggests that pairwise cor- 
relations (and thus Ising models) might not be sufficient to 
predict spike patterns of small scale local clusters of neu- 
rons (< 300/xm apart), which have been observed to pro- 
vide evidence of higher order interactions (Ohiorhenuan 
et al. 2010). While the formalism for higher order mod- 
els may be similar, scaling properties are guaranteed to be 
even worse. There is thus the pressing need to develop 
better algorithms for learning the parameters of Ising and 
Ising-like models. It should be noted that while Ising 



models might not necessarily provide an exact descrip- 
tion of neural spike train statistics, for decoding purposes 
we only require that we can approximate their distribution 
well enough to achieve good decoding performance. 

The development of discrete neural population decod- 
ing algorithms has two motivations. The first motivation 
is the desire to develop brain-computer communication 
devices for cognitively intact patients with severe motor 
disabilities (Mak & Wolpaw 2009). In this type of ap- 
plication, an algorithm such as those we describe could 
be used together with multi-electrode brain recordings to 
allow the user to select one of a number of options (for 
instance a letter from a virtual keyboard), or even in the 
longer term to communicate sequences of symbols from 
an optimized code directly into a computer system or 
communications protocol. Given the short timescale to 
which the Ising decoder can be applied (we have fixed 
this at 20 ms here), and sufficiently large recorded ensem- 
bles to saturate decoder performance, very high bit rates 
could potentially be achieved. 

The second motivation is more scientific: to use such 
decoding algorithms to probe the organization and mech- 
anisms of information processing in neural systems. It 
should be immediately be apparent that the Ising decoder 
and related models can be used to ask questions about 
the neural representation of sensory stimuli, motor states 
or other behavioral correlates, by comparing decoding 
performance under different sets of assumptions (for in- 
stance, by changing the constraints in an Ising model to 
exclude correlations, include correlations within 50 fim, 
etc). This (commonly referred to as the "encoding prob- 
lem") is essentially the same use to which Shannon in- 
formation theory has been applied in neuroscience (see 
e.g. Schultz et al. (2009) for a recent example), with 
simply a different summary measure. Use of decoding 
performance may be an intuitively convenient way to ask 
such questions, but it is still asking exactly the same ques- 
tion. However, there are other uses to which such algo- 
rithms can be applied. For instance, combining sensory 
and learning/memory experimental paradigms, once a de- 
coder has been trained, it could be used subsequently to 
read out activity patterns in different brain states such as 
sleep, or following some period of time - for instance, to 
"read out memories" by decoding the patterns of activ- 
ity that represent them. The decoding approach may thus 
have much to offer the study of information processing in 
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neural circuits. 

Our results show that decoding performance is criti- 
cally dependent on the sample size used for training the 
decoder, as relatively precise characterization of pairwise 
correlations is needed to fit a model that matches the sta- 
tistical structure of as-of-yet unobserved data well. In the 
"encoding problem", such finite sampling constraints re- 
sult in a biased estimate of the entropy of the system. For 
the decoding problem considered here, finite sampling 
leads to overfitting of the model to the observed training 
data, so that it does not generalize well to the unobserved 
data, and accordingly fails to predict stimulus classes cor- 
rectly during test trials. Such finite sampling constraints 
mean that below a particular sampling size - which we 
found to be 400 trials for 70 neurons in one particular 
example we studied - there is no point in using a model 
which attempts to fit pairwise (or above) correlations, one 
may as well just use an independent model. This has im- 
plications for experimental design. However, it should be 
noted that the real brain has no such limitation - in effect, 
many thousands or millions of trials are available over de- 
velopment, and so a biological system should certainly be 
capable of learning the correct correlations from the data 
(Bi & Poo 2001) and thus may well be able to operate in a 
regime where decoding benefits substantially from known 
correlation structure. 

We have shown that incorporating correlations in the 
decoding process might be especially relevant for 'hard' 
decoding problems, i.e. multi-class discrimination prob- 
lems in which stimuli are not easily distinguishable by 
just observing individual neuronal firing rates. In this sce- 
nario including correlations could be a means to enhance 
the precision of the decoding process by increasing the 
discriminability between adjacent or similar stimuli. In- 
cluding correlations could make the pattern distribution 
more flat, or uniform, with low firing rates, leading to 
greater energy efficiency of information coding (Baddeley 
et al. 1997). 

We must insert a note of caution concerning the pres- 
ence of higher order correlations in data to be fitted. Such 
higher order correlations might arise simply from the 
presence of a large number of "hidden" neurons whose 
activity has not been recorded, but which have a substan- 
tial influence on the activity of those neurons recorded. 
This is likely to occur frequently in real neurophysiolog- 
ical situations. The performance of pairwise correlation 



decoder models, such as the Ising model, is necessarily 
affected detrimentally by such effects, as shown in Fig. 8. 
Interestingly, an independent decoder is less affected (al- 
though it may also be capturing less information anyway). 
Obviously, it is possible to make use of higher order mod- 
els to alleviate this problem, but a penalty then has to be 
paid in sampling terms. 

We note that the primary use of the Ising model in 
neuroscience so far has been to model the empirical 
statistics of neural spike train ensembles (Schneidman 
et al. 2006, Shlens et al. 2006). There is of course no 
requirement that a good decoder is also a good model of 
neural spike train statistics - what matters is its perfor- 
mance on the test dataset. Nevertheless, knowing how 
well the model captures spike train pattern structure may 
help to build better decoders, and of course may be of par- 
ticular value when those decoders are used to study neu- 
ral information processing (as opposed to being used for 
a practical purpose such as brain-machine interface devel- 
opment). Unfortunately, a direct comparison of empirical 
to model spike pattern probabilities is not experimentally 
feasible for large ensemble sizes - while this is relatively 
simple for 15 cells, it is far from viable for 500 cells. 
Further work is needed to determine how best to evaluate 
the performance of decoders at predicting empirical spike 
patterns, when decoding very large neural ensembles. 

One caveat to the advantage provided by correlations 
of using Ising over Independent decoders is that it de- 
pends entirely upon the extent to which correlations are 
found to depend upon the stimulus variables of interest. 
While a previous study at longer timescales has found cor- 
relations to improve neural decoding (Chen et al. 2006), 
the jury is still out on the prevalence of stimulus de- 
pendence of pairwise and higher order correlations in 
the mammalian cortex. Stimulus-dependent correlations 
have been found in mouse (Nase et al. 2003), cat (Das 
& Gilbert 1999, Samonds & Bonds 2005) and monkey 
(Kohn & Smith 2005, Kohn et al. 2009), where they 
have been shown to contribute to information encoding 
(Montani et al. 2007), but most recordings to date have 
sampled relatively sparsely from the local cortical circuit, 
due to limitations in multi-electrode array hardware. It is 
possible that if one were to be able to record from a greater 
proportion of neurons in the local circuit, then stronger 
stimulus-dependent correlations might be observable. 

In an experimental setting the performance could also 



15 



be enhanced by choosing an optimal timebin-width, a 
question that we have not addressed in this paper. How- 
ever care is needed for choosing the right bin-width. As 
shown by Roudi et al. (Roudi, Aurell & Hertz 2009, 
Roudi, Nirenberg & Latham 2009), a small bin-width is 
likely to yield a good fit, however choosing a too small 
bin-width invalidates the underlying assumption of uncor- 
rected time-bins. Choosing a too large time-bin makes it 
however harder to find a good fit for the model and addi- 
tionally may violate the assumptions of binary responses. 

A number of avenues present themselves for future de- 
velopment of decoding algorithms. Firstly, algorithms 
for reducing model dimensionality without losing dis- 
criminatory power, may prove advantageous. These 
may include graph and hypergraph theoretic techniques 
(Aghagolzadeh et al. 2010) for pruning out uninformative 
dimensions (edges and nodes), and factor analysis meth- 
ods for modeling conditional dependencies (Santhanam 
et al. 2009). Such an approach may be particularly ad- 
vantageous when experimental trials are limited, as the 
dimensionality of the parameter set is the main reason for 
Ising model performance not exceeding the Independent 
model for limited trials. One difficulty with the use of 
graph pruning approaches is that the usual pairwise cor- 
relation matrix of neural recordings, unlike the graph in 
many network analysis problems, tends not to be sparse. 
It is of course a functional, as opposed to synaptic, con- 
nectivity matrix, and one reason for this lack of sparseness 
is its symmetric nature. It has recently been proposed that 
the symmetry property of the J,j matrix can be relaxed in 
the context of the (non-equilibrium) Kinetic Ising model, 
which also provides a convenient way to take into account 
space-time dependencies, or causal relationships (Hertz 
et al. 2010, Roudi & Hertz 201 1). Use of the Kinetic Ising 
model framework for decoding would appear to be an in- 
teresting future direction to pursue. 

New experimental technologies are yielding increas- 
ingly high dimensional multivariate neurophysiological 
datasets, usually without concomitant increases in the du- 
ration of data that can be collected. However, there is 
some reason for optimism that we will be able to develop 
new data analysis methods capable of taking advantage of 
this data. Maximum entropy approaches to the fitting of 
structured parametric models such as the Ising model and 
its extensions would appear to be one approach likely to 
yield progress. 



Acknowledgements 

We thank Phil Bream, Helene Seiler and Yang Zhang 
for their contributions to earlier work leading up to that 
reported here, and Aman Saleem for useful discussions 
and comments on this manuscript. We also thank Yasser 
Roudi for useful comments on the TAP equations, and 
Jascha Sohl-Dickstein, Peter Battaglino, and Michael R 
DeWeese for useful MATLAB code and discussion of the 
MPFL technique. This work was supported by EPSRC 
grant EP/E002331/1 to SRS. 

References 

Aghagolzadeh, M., Eldawlatly, S. & Oweiss, K. (2010), 
'Synergistic Coding by Cortical Neural Ensembles.', 
IEEE Trans Inf Theory 56(2), 875-899. 

Baddeley, R., Abbott, L. E, Booth, M. C, Sengpiel, E, 
Freeman, T, Wakeman, E. A. & Rolls, E. T. (1997), 
'Responses of neurons in primary and inferior tem- 
poral visual cortices to natural scenes.', Proc Biol 
Sci 264(1389), 1775-83. 

Bair, W. & Koch, C. (1996), 'Temporal precision of spike 
trains in extrastriate cortex of the behaving macaque 
monkey.', Neural Comput 8(6), 1 185-202. 

Bair, W., Zohary, E. & Newsome, W. T. (2001), 'Cor- 
related firing in macaque visual area MT: time 
scales and relationship to behavior.', J Neurosci 
21(5), 1676-97. 

Berger, A. L., Pietra, V. J. D. & Pietra, S. A. D. (1996), 
'A maximum entropy approach to natural language 
processing', Comput. Linguist. 22(1), 39-71. 

Bi, G.-q. & Poo, M.-m. (2001), 'Synaptic modification by 
correlated activity: hebb's postulate revisited', An- 
nual Review ofNeuroscience 24(1), 139-166. 

Bishop, C. M. (2007), Pattern Recognition and Machine 
Learning (Information Science and Statistics), 1st 
ed. 2006. corr. 2nd printing edn, Springer. 

Broderick, T, Dudfk, M., Tkacik, G., Schapire, R. E. & 
Bialek, W. (2007), 'Faster solutions of the inverse 



pairwise Ising problem', http://arxiv.org Identifier: 
0712.2437v2. 



16 



Butts, D. A., Weng, C, Jin, J., Yeh, C.-I. I., Lesica, N. A., 
Alonso, J.-M. M. & Stanley, G. B. (2007), 'Tempo- 
ral precision in the neural code and the timescales of 
natural vision.', Nature 449(7158), 92-5. 

Carr, C. E. (1993), 'Processing of temporal information in 
the brain', Annu Rev Neurosci 16, 223-43. 

Chen, Y., Geisler, W. S. & Seidemann, E. (2006), 'Op- 
timal decoding of correlated neural population re- 
sponses in the primate visual cortex.', Nat Neurosci 
9(11), 1412-20. 

Das, A. & Gilbert, C. D. (1999), 'Topography of con- 
textual modulations mediated by short-range inter- 
actions in primary visual cortex', Nature 399, 643- 
4. 

Ecker, A. S., Berens, P., Keliris, G. A., Bethge, M., Lo- 
gothetis, N. K. & Tolias, A. S. (2010), 'Decorrelated 
Neuronal Firing in Cortical Microcircuits', Science 
327(5965), 584-587. 

Foldiak, P. (1993), The 'ideal homunculus': statistical in- 
ference from neural population responses, in 'Com- 
putation and Neural Systems', Kluwer Academic 
Publishers, Norwell, MA, pp. 55-60. 

Gawne, T. J., Kjaer, T. W., Hertz, J. A. & Richmond, 
B. J. (1996), Adjacent visual cortical complex cells 
share about 20% of their stimulus-related informa- 
tion.', Cereb Cortex 6(3), 482-9. 

Hertz, J., Roudi, Y, Thorning, A., Tyrcha, J., Aurell, E. 
& Zeng, H. L. (2010), 'Inferring network connectiv- 
ity using kinetic Ising models', BMC Neuroscience 
ll(Suppl 1), P51. 

Higham, N. J. (2002), 'Computing the nearest correla- 
tion matrix-a problem from finance', IMA Journal 
of Numerical Analysis 22(3), 329. 

Huang, F. & Ogata, Y. (2001), 'Comparison of two meth- 
ods for calculating the partition functions of var- 
ious spatial statistical models', Australian & New 
Zealand Journal of Statistics 43(1), 47-65. 

Jaynes, E. T. (1957), 'Information Theory and Statistical 
Mechanics', Phys. Rev. 106(4), 620-630. 



Kappen, H. J. & Rodrguez, F. B. (1998), 'Efficient Learn- 
ing in Boltzmann Machines Using Linear Response 
Theory', Neural Computation 10(5), 1137-1156. 

Kohn, A. & Smith, M. A. (2005), 'Stimulus dependence 
of neuronal correlation in primary visual cortex of 
the macaque.', J Neurosci 25(14), 3661-73. 

Kohn, A., Zandvakili, A. & Smith, M. A. (2009), 'Cor- 
relations and brain states: from electrophysiol- 
ogy to functional imaging.', Curr Opin Neurobiol 
19(4), 434-8. 

Macke, J. H., Berens, P., Ecker, A. S., Tolias, A. S. & 
Bethge, M. (2009), 'Generating Spike Trains with 
Specified Correlation Coefficients', Neural Compu- 
tation 21(2), 397^23. 

Mak, J. N. & Wolpaw, J. R. (2009), 'Clinical Applications 
of Brain-Computer Interfaces: Current State and Fu- 
ture Prospects.', IEEE Rev Biomed Eng 2, 187-199. 

Montani, F, Kohn, A., Smith, M. A. & Schultz, S. R. 
(2007), 'The Role of Correlations in Direction and 
Contrast Coding in the Primary Visual Cortex', J. 
Neurosci. 27(9), 2338-2348. 

Nase, G., Singer, W., Monyer, H. & Engel, A. K. (2003), 
'Features of Neuronal Synchrony in Mouse Visual 
Cortex', J Neurophysiol 90(2), 1 1 15-1 123. 

Niell, C. M. & Stryker, M. P. (2008), 'Highly Selective 
Receptive Fields in Mouse Visual Cortex', J. Neu- 
rosci. 28(30), 7520-7536. 

Ogata, Y. & Tanemura, M. (1984), 'Likelihood analysis 
of spatial point patterns', Journal of the royal statis- 
tical Society. Series B. 46(3), 496-518. 

Ohiorhenuan, I. E., Mechler, F, Purpura, K. P., Schmid, 
A. M., Hu, Q. & Victor, J. D. (2010), 'Sparse cod- 
ing and high-order correlations in fine-scale cortical 
networks', Nature 466(7306), 617-621. 

Oram, M. W., Fldik, P., Perrett, D. I., Oram, M. W. & Sen- 
gpiel, F. (1998), 'The 'Ideal Homunculus': decoding 
neural population signals', Trends in Neurosciences 
21(6), 259 - 265. 



17 



Panzeri, S. & Schultz, S. R. (2001), 'A unified approach 
to the study of temporal, correlational, and rate cod- 
ing.', Neural Comput 13(6), 1311-49. 

Panzeri, S., Schultz, S. R., Treves, A. & Rolls, E. T. 
(1999), 'Correlations and the encoding of infor- 
mation in the nervous system', Proceedings of the 
Royal Society of London. Series B: Biological Sci- 
ences 266(1423), 1001-1012. 

Panzeri, S., Treves, A., Schultz, S. & Rolls, E. T. (1999), 
'On decoding the responses of a population of neu- 
rons from short time windows.', Neural Comput 
11(7), 1553-77. 

Petersen, R. S., Panzeri, S. & Diamond, M. E. (2001), 
'Population Coding of Stimulus Location in Rat So- 
matosensory Cortex', Neuron 32(3), 503 - 514. 

Plefka, T. (2006), 'Expansion of the Gibbs potential for 
quantum many-body systems: General formalism 
with applications to the spin glass and the weakly 
nonideal Bose gas', Phys. Rev. £73(1), 016129. 

Pola, G., Thiele, A., Hoffmann, K. & Panzeri, S. (2003), 
'An exact method to quantify the information trans- 
mitted by different mechanisms of correlational 
coding', Network-Computation in Neural Systems 
14(1), 35-60. 

Reich, D. S., Mechler, F. & Victor, J. D. (2001), 'Indepen- 
dent and Redundant Information in Nearby Cortical 
Neurons', Science 294(5551), 2566-2568. 

Renart, A., de la Rocha, J., Bartho, P., Hollender, L., 
Parga, N., Reyes, A. & Harris, K. D. (2010), 'The 
asynchronous state in cortical circuits.', Science 
327(5965), 587-90. 

Roudi, Y., Aurell, E. & Hertz, J. A. (2009), 'Statistical 
physics of pairwise probability models', Frontiers in 
Computational Neuroscience 3:22, 1-15. 

Roudi, Y. & Hertz, J. (2011), 'Mean field theory for non- 
equilibrium network reconstruction', Physical Re- 
view Letters 106, 048702. 

Roudi, Y, Nirenberg, S. & Latham, P. E. (2009), 
'Pairwise Maximum Entropy Models for Study- 
ing Large Biological Systems: When They Can 



Work and When They Can't', PLoS Comput Biol 
5(5), el000380. 

Roudi, Y, Tyrcha, J. & Hertz, J. (2009), Tsing model for 
neural data: Model quality and approximate meth- 
ods for extracting functional connectivity', Phys. 
Rev. E 79(5), 051915 (12 pages). 

Samonds, J. M. & Bonds, A. B. (2005), 'Gamma oscil- 
lation maintains stimulus structure-dependent syn- 
chronization in cat visual cortex.', J Neurophysiol 
93(1), 223-36. 

Santhanam, G., Yu, B. M., Gilja, V., Ryu, S. I., Afshar, A., 
Sahani, M. & Shenoy, K. V. (2009), 'Factor-analysis 
methods for higher-performance neural prostheses.', 
J Neurophysiol 102(2), 1315-30. 

Schneidman, E., Berry II, M. J., Segev, R. & Bialek, W. 
(2006), 'Weak pairwise correlations imply strongly 
correlated network states in a neural population', 
Nature 440(7087), 1007-1012. 

Schultz, S. R., Kitamura, K., Post-Uiterweer, A., Krupic, 
J. & Hausser, M. (2009), 'Spatial Pattern Coding 
of Sensory Information by Climbing Fiber- Evoked 
Calcium Signals in Networks of Neighboring Cere- 
bellar Purkinje Cells', J. Neurosci. 29(25), 8005- 
8015. 

Schultz, S. R. & Panzeri, S. (2001), 'Temporal correla- 
tions and neural spike train entropy.', Phys Rev Lett 
86(25), 5823-6. 

Seiler, H., Zhang, Y, Saleem, A., Bream, P., Apergis- 
Schoute, J. & Schultz, S. R. (2009), 'Maximum en- 
tropy decoding of multivariate neural spike trains', 
BMC Neuroscience 10(Suppl 1), P107. 

Shlens, J., Field, G. D., Gauthier, J. L., Greschner, M., 
Sher, A., Litke, A. M. & Chichilnisky, E. J. (2009), 
'The Structure of Large-Scale Synchronized Firing 
in Primate Retina', J. Neurosci. 29(15), 5022-5031. 

Shlens, J., Field, G. D., Gauthier, J. L., Grivich, 
M. I., Petrusca, D., Sher, A., Litke, A. M. & 
Chichilnisky, E. J. (2006), 'The Structure of Multi- 
Neuron Firing Patterns in Primate Retina', J. Neu- 
rosci. 26(32), 8254-8266. 



18 



Sohl-Dickstein, J., Battaglino, P. & DeWeese, M. R. 
(2009), 'Minimum Probability Flow Learning', 
|http://arxiv.org/abs/0906.4779|arXiv:0906.4779| /2. 

Tanaka, T. (1998), 'Mean-field theory of Boltzmann ma- 
chine learning', Phys. Rev. E 58(2), 2302-2310. 

Thouless, D. J., Anderson, P. W. & Palmer, R. G. (1977), 
'Solution of solvable model of a spin glass', Philos 
Mag 35(3), 593-601. 

Wolpaw, J. R., Birbaumer, N., McFarland, D. J., 
Pfurtscheller, G. & Vaughan, T. M. (2002), 'Brain- 
computer interfaces for communication and con- 
trol.', Clin Neurophysiol 113(6), 767-91. 

Zohary, E. & Shadlen, M. (1994), 'Correlated neuronal 
discharge rate and its implications for psychophysi- 
cal performance', Nature 370(6485), 140-143. 



19 





Figure 1: Tuning curves for a neural ensemble of 50 cells for different heterogeneity parameters 7, £. The shown 
spiking rates correspond to the transient rates, a Tuning curves for basic model, 7 = £ = 0. b Tuning curves with 
heterogeneous firing rates (7 = 1), but fixed tuning widths (£ = 0) (NB the tuning width is defined relative to the 
spontaneous activity), c Tuning curves with heterogeneous tuning widths (£ = 1), but fixed firing rates (7 = 0). d 
Tuning curves for fully heterogeneous scenario (7 = £ = 1) 
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Figure 2: Neural ensemble responses simulated for basic model (7 = £ = 0) for 50 cells, a Correlation matrices for 
4 stimuli (computed with 100000 samples). Diagonal terms set to zero for visualization purposes only, b Simulated 
population neural responses over 1000 trials for two different stimuli, with black indicated a spiking response from a 
neuron on a given trial. 
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Figure 3: Performance of decoding algorithms for the basic model, a Fraction of correct decodings versus neural 
ensemble size, for a training dataset size of 9000 samples per stimulus, b The relationship between fraction correct 
and mutual information 7(s, s) for varying stimulus set and ensemble size (C = {50, 80, 110, 140, 170, 200} varying 
from bottom left to top right for each symbol type). Triangles denote the performance of the Independent decoder, 
squares the TAP Ising Decoder with diagonal weight trick, c The dependence of decoding performance on stimulus 
set size for 70 cells, i TAP, TAPwd, Independent and rMPFL decoders compared to random selection of stimuli. This 
is replotted in ii as the gain in fraction correct over chance performance, 77 = Pdec/Pguess, making the performance 
saturation for the independent decoder as problem difficulty increases more apparent, d Dependence of decoder 
performance on training set sample size, for 4 stimuli and a neural ensemble size of 70. i Fraction correct as a function 
of number of training samples. Below 450 training samples the Ising decoder fails to better the independent decoder, 
ii Relation between mean relative covariance error E as a measure of finite sampling effects and fraction correct. 
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Figure 9: Decoder confusion matrices. The case illustrated is for 16 Stimuli, a 200-neuron ensemble, under the basic 
model (7 = £ = 0). Relative frequency as an approximation to the conditional probability with which each stimulus 
is decoded, for 10000 stimulus presentations, a Ising model decoder (TAPwd). b Independent decoder, c A difference 
plot between a and b. 
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